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Abstract 



The purposes of the^study were to compare two promising item 
response theory (IRT) item selection methods, optimal and 
content-optimal, with two non-IRT item selection methods Introduced to 
provide base ! me results, random item selection and classical item 
selection. The effects of the four item selection methods were 
compared in three ways: (1) overlap in items selected, (2) exam 
information curves, and (3) accuracy of decisions resulting from the 
use of the exams* 

The four Item selection methods were used to construct 20-itim 
exams from an item pool of (approximately) 250 test items* Mastery 
status on the criterion test was determined for candidates by 
administering the full item pool. Three cut-off scores were also 
studied: 65%, 70%, and 7S%- 

The results showed that the optimal exams typically provided 3 to 
4 times more information near the cut-off scores than the exams 
constructed with the random method. Also, the content-optimal method 
produced nearly as good results as the optimal method* Classical 
method results were. In general, better than the random method but not 
nearly as good as the optimal methods. 

The results highlighted the potential of optimal and content^ 
optimal item selection methods for improving the decision-making 
capabi 1 1 ties of fixed-length certification exams. One coniequence of 
these results is the potential for shortening conventionally 
constructed credent! al ing exams without losing decision accuracy. 
Alternatively, with a pre-spec if led length for a credential ing exam, 
the optimal item selection methods can Improve decision accuracy over 
other non-optimal item selection methods . 
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Optimal Item Selection with Credential 1ng Examinations^'^ 

Ronald K- Hambleton, Dean G, Arrasmlth 
University of Massachuietts at Amherst 

and 

I, Leon Smith 
Professional Examination Service 

Credential Ing examinations In the United States and Canada might 

be described In two ways i important and lengthy . The Importance of 

these exams is clear when 1t is noted that over 900 professions now use 

the results of credential ing exams to award certificates ^ diplomas * or 

licenses. In many of these same professions, a person cannot practice 

until a credential ing examination (or a recredential Ing examlnatlonj 1n 

many cases) has been passed. 

Another common characteristic of credential ing exams 1s their 

unusual length. Exams with 200 to 500 Items are regularly found 1n 

practice. The excessive lengths of many of these exams are often 

defended by their developers on the grounds that high levels of content 

validity and rel lability are needed. Also, since credential ing exams 

are rarely pilot-tested * exam developers argue that extra Items are 

needed so that "bad" items Identified following exam administrations 

can be eliminated from exam scoring without fear of shortening exam 

lengths to the point where the psychometric properties of exam scores 

would be unacceptable. 

There appears to be a widespread belief among those who sponsor 

credential ing exams (e,g., associations, agencies, etc.) that long 

^ Laboratory of Psychometric and Evaluative Research Report No. 1B7 . 
Amherst, MA -"University of" Massachusetts , 1QS7." 

A paper presented at the annual meeting of AERA, Washington, 1987 • 
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exams are betttr than short exams. But, 200 to SCO exam items with 5 
to 6 hours of exam'adminlstratidn time seems excessive. In addition, 
shorter exams could be an improvement over the longer exams if the 
limited exam development funds were used to improve the smaller number 

of necessary exam items* 

Hambleton and de Gruijter (1983) and de Gruijter and Hambleton 
(19B3) demonstrated, using computer simulated exam data, the advantages 
of another method for improving exams that also reduces exam length: 
nntimal item selection . For any given exam length, the most valid exam 
for separating candidates into "passes" and "failures" includes items 
that discriminate effectively near the cutoff score on the exam score 
scale (Lord, 1980; Lord & Novick. 196S) . Such an exam is constructed 
using optimal item selection (Hambleton R Swaminathan, 1985). But 
credentialing exam development specialists have not usually taken 
advantage of optimal items for an exam, perhaps because they are 
unfamiliar with the general approach and/or with item response theory 
(rRT), a test theory framework that must be understood and used in 

optimal item sel ;ion. 

instead, classical item statistics are often used by exam 
developers in item selection, but these statistics have limited 
usefulness in constructing exams to discriminate effectively at a 
cut=off score of Interest. The main shortcoming is that classical item 
statistics (item difficulty and discrimination indices) are defined 
over a population of candidates. The cut=off score set to separate 
"passes" and "failures" is defined over a domain of content. 
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Unfortunately, cle :? itmn -=ti :ics and the cut-off score are not 
defined on the setiv ^cale 1 the efore the item statistics cannot be 
used conyenlentl} » sc?'. * ' hl an optimal set of Items for an exam. 
Optimal item se^ ^ )u ^usres that item statistics and the cut-off 
score be defined o^" t-^ ^m^M scale* Item response theory can provide 
the needed scale yhen an item response model can be found to fit the 
exam data (Lord, 198fl| Hambleton & Swaminathan, 1985), 

Optimal item selection, however, is not without problems. One 
problem Is that when statistical criteria only are used 1n item 
selection, there is the great risk of producing exams which lack 
content validity. Computerized adaptive testing 1s often criticized 
for the same reason. It appears that optimal Item selection algorithms 
will need to be modified to include content considerations to avoid 
what seem to be a legitimate criticism* The effects of modifying 
optimal item selection algorithms to accormiodate content considerations 
are unknown. 

The purposes of the present paper were to compare two promising 
item selection methods, optimal and content- optimal (the optimal method 
modified to include content considerations) with two Item selection 
methods Introduced to provide some baseline results', random Item 
selection and classical item selection. Complete details on the 
methods are provided 1n the next section. The effects of the four Item 
selection methods were compared in three ways i (l) overlap in Items 
selected, (2) exam information curves, and (3) decision accuracy. In 
addition, several cut-off scores were studied to investigate the 
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effects of item selection methods when the cut-off scores and 
associated exam passing rates varied substantiany* 

Method 

Exam Item Pool 

The basic data for the study came from, a certification examination 
in the health field administered in 1985, A three-parameter IRT model 
analysis was carried out on the exam data to provide item statistics 
and corresponding item information functions for later use in the exam 
development process* The item calibrations were carried out using 
LOGIST (Wood, Wingersky, & Lord, 1976). 
Item Selection Methods 

Four item selection methods were compared: 
1* Random, Exam items were selected without regard for their 
item statistics or content* (We note, however, that all 
available Items had been carefully reviewed by a committee 
and judged acceptable for use in the exams*) Random item 
selection, subject usually to some content constraints, is a 
commonly used item selection method (Hambleton, 1982; 
Hambleton & Rogers, 1986}* 
2* nptimal . Exam items were selected which provided maximum 
information at the cut-off score of interest. Item content 
was not a factor in item selection. 
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3, Content-Optimal ^ Exam items were selected which provided 
maximum information at the cut-off score of interest subject 
to the constraint that the final version of the exam must 
meet the content specifications approved by the exam 
committee . 

4* Classical . Items were selected that had (1) p-values between 
(about) ,40 and .80 and (2) th« highest classical item 
discrimination indices (biserial correlations)* In addition, 
the exam needed to meet the content specifications approved 
by the exam conwittee. 
For the purpose of this Investigation, exams consisting of 20 items 
were constructed* Exam length wfis kept short to minimize the overlap 
with the criterion exam which 1s described in the next section. 

The content specifications for the criterion exam were organlzfid 
by the national committee for the specialty into a two dimensional 
grid* The percentage of items in each cell of the two-dimensional grid 
were followed as closely as possible in building 20-item content valid 
exams with the content-Qptimal and the classical item selection 
methods. 

Criterion Tes t 

One of the criteria for evaluating the item selection methods was 
the pass/fail decisions resulting from the administration of the 
(approximately) 250-item certification exam. Of interest was the match 
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between pass/fall decisions based on this criterion test with pass/fan 
decisions based on the ZO-item exams constructed using the four Item 
selection methods. Since the Itens selected for the 20-itCT exams were 
from the pool of Items defined by the criterion test, the overlap In 
exam Items (albeit slight) between the short exams and the criterion 
test Inflates the levels of agreCTent between decisions based on the 
20-1tem exams and the criterion test,/ Fortunate! y> this overlap did 
not Influence the results addressing the comparison of methods because 
the slight positive bias in 'assessing agre^ent was conmion to all four 
iton selection methods. 
Cut-off Scores 

Three cut-off scores for the criterion test were considered in the 
studyi 65%, 70%, and 75%, These cut-off scores resulted In 
(approximate) passing rates of 90%, 80%, and 50% in the sample of 
(over) 1500 candidates, respectively. The corresponding cut-off scores 
on the exam ability scale were -1*00, -0,50, and ,125, respectively, 
and obtained using the test characteristic curve for the total set of 
test Items in the criterion test (Hambleton & Swaminathan, 1985J, The 
cut-off scores on the ability scale were the points used to buHd 
optimal and content-optimal exams* 

Procedure 

For each cut-off score (65%, 70%, and 75%), 20-1tem optimal and 
content optimal exams were constructed. In addition, single 20-item 
exams using the random and classical methods were constructed. In 
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total* eight 20-1tem exams were constructed from the available pool of 
test itens - optimal and content-optimal exans at each of three cut-off 
icores, plus one e^^am constructed using the random item selection 
method and one exam constructed using the classical item selection 
method , 

For each of the ZO-ittm exams, candidate exam item scores were 
obtained, exam ability scores were estimated, and pass-fail decisions 
were made by comparing the ability estimates to the correct cut-off 
score (-1.00 with the 65X cut-off score, -0.50 with the 70% cut-off 
score, and .125 with the 75% cut-off score). 
Evaluation of the Item Selection Methods 

For each cut-off score, and ittm selection method, five evaluative 
criteria were of interest: 

1. Percent of non-masters (as determined by the criterion test) 
who failed the 20-1tem exam (correct decisions) and who 
passed the 20-item exam (Incorrect decisions). 

2. Percent of masters (as determined by the criterion test) who 
passed the 20-item exam (correct decisions) and vrtio failed 
the 20-itaii exam (incorrect decisions). 

3. Overall accuracy rate (percent of candidates who were 
correctly classified). 

These three statistics were calculated, first, for the total pool of 

candidates, and second, for the subsets of candidates scoring near the 

cut-off score. In the second analysis, only candidates scoring within 

one standard error of measurement of the cut-off score (about three 

score points) on the criterion test were included. The second set of 

statistics was calculated because it is among candidates scoring near 

the cut-off score that optimal or content-optimal item selection 
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methods might be expected to be the most uie^yl, Cons^iderabl e intereit 

is centered 1n eKam development on this grOLipli^caus^-^ these candidates 

are the ones who are most likely to be misclesfified^ 

Two othfr criteria were also used to intifpret the results; 

4* The information functions for exams Conslriicted \/^Tth the four 
item selection methods, 

5, The probabilities of miscl asstf Icatlon with ^ the various 
exams , 

Results 

IRT Goodness of Fit Studies 

Tables 1 and 2 provide information cont^i'fiing IR~T model-exarn data 
fit* Unless the chosen IRT model fit the eK^Jita^ th»ie research would 

Insert Tables 1 and Z abauiliere* 

have had little merit. In fact, the fitsofthe one-, two-^, and 
three-parameter logistic models to 75 randoril^chosefin exam items from 
the total pool of items were all quite gooJ, thou»gh the two- and 
three- par ameter models fit the test dataiofnewha t better, (The 
c-parameter was set equal to ,20 for all itenisyith th^a three-param&ter 
model,) Only a subset of items were analyjeMn th « i s phase of the 
research because of the limits of the LQGlSIprografM , and our belief 
that a random set of items would be quite Sufficient fo •r addressing the 
model-data fit question. 

About 7^% of the standardized residuals (calculat- ed for each test 
It^m 1n 12 equal-sized Intervals between -3^0a[id +3,^ q on the ability 
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scale using the two* and three-parameter logistic curves) had values 
less than one. Less than 1% of the standardised residuals exceeded a 
value of three. Clearly, the two- and three- parameter models provided 
excellent fits to the test data. 

Also, the misfit statistics (the standardized reslauals) reported 
In Table 2 for the two- and three-parameter models were not correlated 
with the content categories of the exam items. The results from 
the one-parameter model were very different and had this model been 
used In our later work, an oversampling of Items from a few of the 
content categories would have resulted. The findings in Tables 1 and 2 
lent support to (1) the credibility of the unidlmensional ity assumption 
for the full set of exam items and (2) our decision to proceed 1n the 
research with the three-parameter model. 

Parameter Estimation 

The actual LOGIST runs were carried out with the c-parameter in 
the three-parameter model set to a value of .20. Table 3 provides 
information pertaining to the item difficulty (b-paramtter) and the 
Item discrimination (a-parameter) estimates for full set of test items, 
An analysis of Table 3 revealed that many Items were of very limited 
value in the optimal or content-optimal exams of interest , either 
because they were very easy (high negative b- values) or 
non-discr1m1nat1ng ( low a-val ues) , Also, the limited variability among 
the a-parameter estimates reduced the effectiveness of the optimal and 
content-optimal item sel ection methods . In general, the optimal item 



PES. 10 



11 



-1(1- 

selection methods will te most yseful v/hen there is considerable 
varlablHty among the test Items inan Item pool. 

Insert Table Jabout here. 

Overlap in the 20-Item Exams 

Table 4 shows the overlap ofitims 1 n the exams constructed at 
each cut-off score. Use of the optimal and content optimal item 
selection methods resulted In considerabl e overlap, which was to be 
expected, regardless of the cut-off score. The random method, also as 
expected, did not overlap to any mimt witli the other three mtthods. 
The classical method overl apped modirate 1 y with the optimal and* 
content-optimal at the high cut-off score (75%) and overlapped only 
slightly at the lower cut-off scores(65l, 70m), This finding seems to 
indicate that when the chosen cut-off score is far from the center of 
the Bxm score distribution (at 6S% ooly 14% of the candidates failed), 
optimal (and content-optimal) e;<aios look w^ery different than exams 
constructed using classical methods On the other hand, when the 
cut-off score is near the center of the e^am score distribution (at 
75%, 48% of the candidates faileajj classi cal methods function more 
like optimal exams. In practice, howgyer, the cut-off score for a 
certification exam is seldom close to the mean exam score. 

Insert Table 4 about here, 
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Exam Information Curves 

Figure 1 provides the exam inforaatlon curves for the four 20-1tem 
exams at each cut-off .score. An analysis of the exam information curves 
shows, typically, that the Informatldn 1s 3 to 4 times greater from the 
optimal and content-optimal methods than the random method. Such 
Improvements In exam information mean that the standard errors of 
ability estimates for candidates around the cut-off score with the 
optimal exams will be about 50% smaller than the standard errors 
associated with exams constructed using the random method. Therefore, 
substantially fewer of these candidates will be mlsclassi f led . The 
differences In Information functions for exams constructed with the 
optimal methods and the classical method were less^ however the 
differences were still of practical importance, especially at the two 
lower cut-off scores* 

Table 5 provides some results which address the probabilities of 
misclassification for candidates with ability scores -2.5, -2,0* -1.5, 
-1,0, -.5, 0, ,5, and 1,0 on each of the four exams and with each of 
the cut-off scores, 65%, 70%, and 75%, The probabilities were obtained 
by assuming a normal distribution of ability estimates for each ability 
level of interest and a standard deviation for the normal distribution 
equal to the standard error of estimation associated with the exam used 
to obtain the ability estimates. The standard error of ability 
estimation equals l/^Info(8) where Info(8) Is the information provided 
by the exam at the ability level of interest (Hambleton S Swaminathan, 
1985). The statistics reported In Table 5 confirms the substantial 
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Insert Table 5 about here 

theoretical advantages of optimal and content-optimal item selection 
algorithms. For example^ consider B=-2,0 and cut-off score = 65%. The 
probability of misclasslfying the examinee using the exams constructed 
with the random and classical methods Is at least four times larger 
than the probabilities of m1scl assi f ication associated with the two 
optimal Item selection methods. In fact, at nearly every ability level 
and for every cut-off score, the optimal and content-optimal Item 
selection methods produced exams that substantially outperformed the 
exams constructed using the other two itm selection methods. 
Analysis of Decision Accuracy 

Tables 6 and 7 provide surTinarles of the decision accuracy results 
for the total and constrained samples of candidates. Results in the 



Insert Tables 6 and 7 about here. 

tables highlight the actual decision accuracy results for the various 
exams and cut-off scores. Though the gains in decision accuracy with 
optimal and content optimal Item selection methods with the real data 
were modest in size over the other two methods (they ranged from 1% to 
16%), they are of practical significance. Recall first that these 
improved results were obtained without any increase in exam length. 
Any increases, however slight, as long as they do not Involve major new 
test development expenses or an excessive amount of time, would seem 
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worthy of serious consideration by certification boards in view of the 
desirability of increasing the decision accuracy (i.e. validity) of 
their exami. Improved decision accuracy resulted with the non^masters 
groups especially^ In part, because on the average these groups were 
clbser to the cut-off scores. Second, rather sizeable Increases In 
exam length with the random and classical methods would be required to 
obtain even 3% to 4% Increases In decision accuracy. Using the real 
data and the random method, the levels of decision accuracy as a 
function of exam length for the three cut-off scores were calculated. 
Even a gain in decision accuracy of 4% would require an exam 
constructed with the random method which would be nearly double in 
length! Thus, small gains in decision accuracy corresponded to rather 
large changes in exam length. 

Conclusions 

The evidence collected from this study showed that the optimal 
exams typically provided 3 to 4 times more information than the random 
exams and resulted in practically significant ImprovOTents in decision 
accuracy. To address the legitimate ccmplaint that optimal exams may 
lack content validity^ a method that balanced content with statistical 
considerations was also studied. The content-optimal method, also, 
produced very promising results* In fact, the results from this method 
were almost as good as the results obtained with the optimal method and 
1n a few cases the results were better. Classical method results were, 
In general, better than those results obtained with the random method 
but not as good as the optimal methods. 
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The results from thVs study highlight the potential of both 
optimal and c o n t e n t - o p 1 1 m a 1 item selection methods for improving the 
decision-making accuracy (i.e., validity) of fixed-length credential ing 
exams. If exam lengths are fixed, optimal and content-optimal methods 
can lead to Increased decision accuracy over non-optimal itm selection 
methods* Alternatively^ if decision-accuracy results with the 
non-optimal item selection methods are acceptable, the use of optimal 
item selection methods can lead to substantially shorter exams without 
any reduction in decision-accuracy. This finding should be especially 
Important and interesting to credential ing exam boards who may wish to 
shorten their exams without affecting the levels of decision-accuracy 
obtained from their credential ing exams constructed with non-optimal 
ItOT selection methods. 

In conclusion, one final point should probably be made about the 
results. Though the results from applying optimal item selection 
methods in this study were positive, even more positive results are 
likely to be observed in other applications. This is because optimal 
item selection methods will be most effective when applied to large 
statistically diverse item pools. In this study, the item pool 
consisted of relatively homogeneous test items. That is, the exam 
items showed very little variability in their discriminating power. 
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Table 1 

Summary of the Absolute-Valued Standardized Residuals 
for 75 Items in the Certification ixam1nat1on 



IRT 


Percent of 


Absol ute-Val ued 


Standardized 


Residuals 


MODEL 


|0 to Ij 


|1 to 2j 


|2 to 3| 


1 ov er 3 1 


1-P 


6i.4 


30.5 


7.0 


1.1 


2-p 


70.4 


26.5 


2.9 


0.1 


3-p 


70.2 


26.5 


2.9 


0.4 
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Asiociation Bstwien ^bsolute-Vilued 
Stindardized Risiduils and Itan Content 
on 75 Items in the Cirtlfication Exiniination 



fJumbef 

of 
Items 


Percent of Standardized Residuals " 
hp Hodil 2.p Hodil 
SR(<.8Q) SR(>.80) SR(<.80) SR(>,80) 

(n=26) (n=4e) (N3j [mWf 


3^p hdil ' 
[n-4Dj [n^Zoj 


40 
13 


25.0 75.0 
46.2 53.8 


52.5 47.5 
76.9 23.1 


60.0 40.0 
' 69,2 30.8 


8 


87.5 12.5 


62.5 37.5 


62,5 37,5 


5 


60.0 40.0 


60.0 40.0 


60.0 40.0 


5 


0.0 100.0 


80.0 20.Q 


80.0 20.0 



^^ = 65.26 x2^3.35 « 1.08 

^i.^. p<.O01 d.f. i4, p..50 d,f. M, p..90 



iicyrity riisons, the contint categories cannot be identified 
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Tabli 3 

Surtinary of the Item Par amettr Estimates- 
(c = .20) 

Jiscrimlnation Difficulty Paranieter Estimates 

Parameter 

Estimates < -2 -2 to -1 -1 to 0 0 to 1 1 to 2 over 2 

less than ,30 17,7 4,8 4.0 3-6 2,0 3,2 

,30 to ,60 17.3 16,5 9,6 6.4 2.4 1,2 

over ,60 2,8 4,8 1,2 2.0 0.4 0,0 

^Percent of test items are reported. 
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Table 4 





rc!««eiiu or uveriap in 


the 20- 


Item 


Exams 




Cut-Off 
Score 


Item 
Sel ection 
Method 


2 


Item 


Selection Method 
3 


4 


65% 


1* Random 

2p Optimal 

3 • Content-Optimal 

4. Classical 


5% 




10% 
75% 


5% 
5% 
15% 


70% 


1, Random 

2, Optimal 

3* Content-Optimal 
4. Classical 


10% 




10% 
80% 


5% 

20% 

25% 


75% 


1 . Random 

2. Optimal 

3. Content-Optimal 

4. Classical 


10% 




10% 

85% 


5% 
30% 
35% 
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Table 5 

Surmiary of Misci assl fication Probabnities 
for Vafious Cut-off Scores, Exams, and Ability Level 



Cut-off 










Ability 








Score 


Method 


-2.S 


-2.0 


-1.5 


-1.0 


-.5 


0.0 


0.5 


1.0 


65% 


Random 


9.7 


17.3 


30 .,9 


50.0 


30.0 


14.9 


6.8 


3.3 




Optimal 


0.8 


3.0 


15.7 


50.0 


17.7 


5.4 


2.5 


1.9 




Content-Optimal 


1.6 


4.2 


17.0 


50.0 


17.4 


4.7 


1.6 


1.0 




Classical 


14.2 


19.0 


29.8 


50.0 


23.9 


6.3 


1.0 


0.2 


70% 


Random 


4.2 


7.9 


15.9 


30.9 


SO.O 


30.2 


16.0 


8.3 




Optimal 


O.S 


0.8 


3.5 


16.9 


50.0 


17.6 


4.5 


6.0 




Content-Optimal 


0.6 


1.0 


4.0 


17.5 


50.0 


18.5 


5.2 


4.0 




Classical 


7.7 


9.4 


14.5 


26.5 


50.0 


22.2 


6.1 


1.4 


75% 


Random 


1.2 


2.2 


5.2 


12.2 


25.6 


44.8 


35.5 


21.0 




Optimal 


0.6 


, 0.6 


0.8 


2.7 


12.1 


40.5 


24.2 


7.0 




Content-Optimal 


0.6 


0.6 


0.7 


2.7 


12.5 


40.8 


25.1 


7.7 




Classical 


3.0 


3.1 


4.3 


8.0 


18.8 


42.4 


28.1 


10.1 



llhe authors would like to thank Alison Zhou for preparing these results. 
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Table 6 
Decision Accuracy Results 
(Total Sample) 



Cut-off 




Non- 

ral I 


Piasters 
Pass 


Masters 
Fail Pass 


Overal 1 
Accuracy^ 


65% 


Random 
Optimal 

Content- Optimal 
Classical 


69.7% 
79.6% 
81.5% 
73.0% 


30.3% 
20.4% 
18.5% 
27.0% 


14.0% 
9.0% 
8.1% 
9.5% 


86.0% 
91.0% 
91.9% 
90.5% 


83.8% 
89.4% 
90.5% 
88.1% 


70% 


Random 
Optimal 

Content-Optimal 
Classical 


72.4% 
80.2% 
78.8% 

77.6% 


27.6% 
19.2% 
21.2% 
22.4% 


16.7% 
12.5% 
12.7% 
13.1% 


83.3% 
87.5% 
87.3% 
86.9% 


80.3% 
85.5% 
84.9% 
84.4% 


75% 


Random 
Optimal 

Content-Optimal 
Classical 


76.3% 
85.9% 
85.4% 

82.2% 


23.7% 
14.1% 
14.6% 
17.8% 


30.2% 
23.1% 
23.1% 
22.8% 


69.8% 
76.9% 
76.9% 
77.2% 


73.0% 
81.2% 
80.9% 
79.6% 



1 Ovtrall Accuracy is the percent of masters who pass and non-masters who fail 
in the total sample of (over) 1500 examinees for the 20- Item exams. 
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Table 7 
Decision Accuracy Results 
(Constrained Sample) 



Cut-off Non-Masters Masters Overall 

Score Method N Fail Pass N Fall Pass Accuracyl 



65% Random 
Optimal 



Content-Optimal 
Classical 



70% Random 
Optimal 



Content-Optimal 
Classical 



75% Random 
Optimal 



Content-Optimal 
Classical 



79 


54.6% 


45.4% 


268 


40.0% 


60.0% 


58 


.4% 


79 


68.5% 


31.5% 


268 


35.8% 


64.2% 


65 


.6% 


79 


70.4% 


29.6% 


268 


33.7% 


66.2% 


67 


.5% 


79 


59.3% 


■ 40.7% 


268 


34.2% 


65.8% 


63, 


.8% 


178 


62.2% 


37.8% 


437 


38.6% 


61.4% 


65. 


.3% 


178 


69.3% 


30.7% 


437 


30.2% 


69.8% 


69. 


,0% 

,9% 


178 


67.1% 


32.9% 


437 


30.5% 


69.5% 


67. 


178 


61.6% 


38.4% 


437 


31 .9% 


68.1% 


66. 


7% 


307 


60.2% 


39.8% 


B07 


40.4% 


59.6% 


59. 


8% 

2% 


307 


73.7% 


26.3% 


507 


35.7% 


64.3% 


68. 


307 


73.1% 


26.9% 


507 


24.6% 


65.4% 


68. 


6% 


307 


67.0% 


33.0% 


507 


36 . 3% 


66 . 7% 


55. 


2% 



loverall Accuracjr 1s the percent of masters who pass and non-masters wfio fail 1 
the constrained samples for the 20-1tan exams. 
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Figure I . Test Information Functions for the 
2.0 Item Tests 

Key; I -Random, 2- Optimal, 3-Optimal-Content, 4-Classical 
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